home *** CD-ROM | disk | FTP | other *** search
- SMILES notation
- ===============
-
- This introduction to SMILES notation is based on a more detailed
- description in the following paper,
-
- "SMILES, a chemical language and information system", D. Weininger, Journal
- of Chemical Information and Computer Sciences, 28 (1988) pp31-36.
-
- SMILES (Simple Molecular Input Line Entry System) notation allows the two
- dimensional graph of a molecule (and certain aspects of it's three
- dimensional structure) to be written as a concise, one dimensional, string
- of characters. This allows computers to store large numbers of chemical
- structures in a small space and also enables them to be processed extremely
- quickly. The beauty of SMILES, as compared to other encoding systems you
- could devise, is that humans can also quite easily look at a SMILES string
- and determine what molecule it represents and, conversely, easily construct
- a SMILES string that represents a given structure.
-
- A SMILES notation is a sequence of characters that ends with a white
- space. Hydrogens may be omitted or included. Aromatic structures can be
- specified directly or in their Kekulé form.
-
-
- Atoms
- =====
-
- Atoms are represented by their atomic symbols. Atoms not from the
- "organic subset", that is B, C, N, O, P, S, F, Cl, Br, and I, are written
- enclosed in square brackets to separate them from the next. The presence of
- enough hydrogen atoms to fill up any unused bonds to an atom is implied
- unless the atom symbol is enclosed in square brackets and the number of
- attached hydrogens is explicitly stated. Charges on an atom, if present, may
- also be specified in the square brackets.
-
- For example
-
- SMILES Molecule
-
- C methane
- N ammonia
- O water
- [Au] elemental gold
- [OH-] hydroxyl anion
- [OH3+] hydronium cation
- [Fe+2] or [Fe++] iron (II) cation
- [NH4+] ammonium cation
-
- Atoms in aromatic rings are specified by lower case letters, eg 'c' for
- an aromatic carbon atom.
-
-
- Bonds
- =====
-
- Single, double, triple, and aromatic bonds are represented by the symbols
- '-', '=', '#', and ':' respectively. Single and aromatic bond symbols may
- be, and usually are, omitted.
-
- Examples
-
- CC ethane
- C=C ethene
- CCO ethanol
- C#N hydrogen cyanide
- [H][H] molecular hydrogen
-
-
- Branches
- ========
-
- Branches off the main chain are enclosed in parentheses.
-
- For example, (these and the following more complicated structures are
- drawn out in the file 'SMILESegs')
-
- CCN(CC)CC triethylamine
- CC(=O)O ethanoic acid
-
- Branches may be nested, for example
-
- C=CC(CCC)C(C(C)C)CCC
-
- is a perfectly valid SMILES string.
-
-
- Cyclic structures
- =================
-
- Rings are first converted to linear structures by breaking a single (or
- aromatic) bond. The SMILES for the resulting linear structure is then
- written as normal except that a ring closure number is added after each of
- the two atoms that had the bond between them broken.
-
- For example, (remembering lower case atom symbols imply aromaticity)
-
- C1CCCCC1 cyclohexane
- c1ccccc1 benzene
- C1C=CC=C1 cyclopentadiene
- Oc1ccccc1 phenol
- Brc1cc(Br)cc(Br)c1 tribromo-benzene
-
- There may be more than one way of writing the structure as a SMILES. For
- example 1-methyl-3-bromo-cyclohexene may be written as
-
- CC1=CC(Br)CCC1
-
- or as
-
- CC1=CC(CCC1)Br
-
- An individual atom may be involved in closing more than one ring, in
- cubane for example, in this case all the ring closure numbers associated
- with the atom are written after it.
-
- So cubane may be written as
-
- C12C3C4C1C5C4C3C25
-
- Ring closure digits may be reused, however more than 9 may still be
- needed, in this case (ie for ring closure numbers of 10 or greater) the two
- digits are preceded by a '%' symbol.
-
- To illustrate both these things
-
- C%12CCCCC%12N=NC%12CCCCC%12
-
- represents two cyclohexane rings joined by a two nitrogen atom linker.
-
-
- Disconnected structures
- =======================
-
- Disconnected structures are written as individual SMILES separated by a
- full stop. For example sodium phenoxide can be written as
-
- [Na+].[O-]c1ccccc1
-
- or even
-
- c1cc([O-].[Na+])ccc1
-
- Note, however, that no association of ions is implied by the order in
- which disconnected structures appear in the SMILES.
-
-
- Isomerism
- =========
-
- The stereochemistry at chiral centres can be specified in SMILES. The
- chiral atom should be enclosed in square brackets with either one or two '@'
- symbols following it. One '@' implies that the branches that follow it in
- the SMILES string occur in an anticlockwise arrangement. Two '@' symbols
- mean the branches occur in a clockwise arrangement. This is undoubtedly
- totally unclear, so here is an example
-
- OC(=O)[C@@]([H])(N)Cc1ccc(O)cc1 L-Tyrosine
-
- here the [H], N and Cc1ccc(O)cc1 are arranged clockwise when viewed along
- the bond from the carboxyl group, OC(=O), to the chiral carbon atom. The
- other isomer is
-
- OC(=O)[C@]([H])(N)Cc1ccc(O)cc1 D-Tyrosine
-
- which has the three groups in an anticlockwise arrangement. These can be
- written more simply as
-
- OC(=O)[C@@H](N)Cc1ccc(O)cc1 L-Tyrosine
-
- OC(=O)[C@H](N)Cc1ccc(O)cc1 D-Tyrosine
-
- As an explanation for the use of the '@' symbol, and as an aid to
- remembering which is which; an '@' symbol is an 'a' with an anticlockwise
- circle around it.
-
- The cis/trans isomerism of double bonds can also be specified. The
- symbols '/' and '\' are used, they should precede and/or follow the atoms
- which are doubly bonded. For example
-
- Cl\C=C/Cl cis dichloro-ethene
-
- Cl\C=C\C1 trans dichloro-ethene
-
- or for a double bond with two groups at each end
-
- Cl/C(Br)=C(/I)F
-
- This will have Cl and I trans to each other, with the Br at the same end
- as the Cl, and the F at the same end as the I.
-
-
- Using the above rules almost all organic structures can be written in
- SMILES notation. To demonstrate this point the final complicated example,
- morphine
-
- O1C2C(O)C=CC3C2(C4)c5c1c(O)ccc5CC3N(C)C4
-
- can be written in a simple (it is when you get used to it!) and concise
- way.
-
-
- ---
- Simon Kilvington, 3/11/94
-